This is a general template on how you can analyze your RNA-Seq data using the DESeq2-package and the DESeqtools-package, which provides additional plots and some convenient wrappers.
The aim was to write a script that can be used as a template for standard bulk RNA-seq analysis. For this example analysis, a RNASeq dataset of human isolated monoctyes is used, which were differentiated to monocyte-derived dendritic cells (moDC) in presence or absence of interleukin-10 (IL-10). In addition, cell maturation by application of LPS leads to a total number of 4 conditions with 3 samples each. The raw data of the 12 samples is accessible under https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92852.
The data has been preprocessed using kallisto.
This example analysis script covers:
Here, you are supposed to write down all necessary information about the project you are working on! We think this is very important for collaboration partners or other researches working with the data (after your time) to know the background of the project and the preprocessing steps performed!
The data to be analyzed has to be in a certain structure, which is pretty much the structure it has coming out of out the preprocessing pipeline. Additionally, you need to download annotation files from a public sciebo folder.
In case you observed an obvious batch effect in the data during the exploratory data analysis, this part of the script allows you to estimate the influence of both known and hidden batch effects in a PCA. Batch variables can later be included in the DESeq2 design.
Some explanations in this script are borrowed from the great vignette of the DESeq2-package. You can find the full documentatio here: http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
First, you need to make sure that you have the devtools package installed. Then, you can install the DESeqtools package from Github, which will automatically install all package dependencies and annotation files.
#install.packages("devtools")
# library("devtools")
#install_github("deniznazs/DESeqtools", ref = "package")
To load the package, run:
library(devtools)
## Warning: package 'devtools' was built under R version 3.6.1
## Loading required package: usethis
devtools::load_all(path = "E:/Deniz/Git/usr/bin/DESeqtools_Deniz")
## Loading DESeqtools
##
## Registered S3 method overwritten by 'enrichplot':
## method from
## fortify.enrichResult DOSE
##
##
## Warning in (function (dep_name, dep_ver = "*") : Need RColorBrewer >= 2.0.0
## but loaded version is 1.1-2
## Loading required package: pheatmap
## Loading required package: DESeq2
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter,
## Find, get, grep, grepl, intersect, is.unsorted, lapply, Map,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
## setdiff, sort, table, tapply, union, unique, unsplit, which,
## which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## Loading required package: BiocParallel
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
##
## aperm, apply, rowsum
## Loading required package: tidyr
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:S4Vectors':
##
## expand
## Loading required package: ggplot2
## Loading required package: ggrepel
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:IRanges':
##
## space
## The following object is masked from 'package:S4Vectors':
##
## space
## The following object is masked from 'package:stats':
##
## lowess
## Loading required package: ggbeeswarm
## Loading required package: hexbin
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Loading required package: factoextra
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:Biobase':
##
## contents
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: VennDiagram
## Loading required package: grid
## Loading required package: futile.logger
## Loading required package: openxlsx
## Loading required package: rhdf5
## Loading required package: clusterProfiler
## clusterProfiler v3.12.0 For help: https://guangchuangyu.github.io/software/clusterProfiler
##
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
##
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:DelayedArray':
##
## simplify
## Loading required package: GSEABase
## Loading required package: annotate
## Loading required package: AnnotationDbi
## Loading required package: XML
## Loading required package: graph
##
## Attaching package: 'graph'
## The following object is masked from 'package:XML':
##
## addNode
## Loading required package: RColorBrewer
## Loading required package: ComplexHeatmap
## ========================================
## ComplexHeatmap version 2.0.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
## ========================================
## Loading required package: tximport
## Loading required package: vsn
## Loading required package: genefilter
##
## Attaching package: 'genefilter'
## The following object is masked from 'package:ComplexHeatmap':
##
## dist2
## The following objects are masked from 'package:matrixStats':
##
## rowSds, rowVars
## Loading required package: biomaRt
## Loading required package: limma
##
## Attaching package: 'limma'
## The following object is masked from 'package:DESeq2':
##
## plotMA
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
## Loading required package: sva
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:IRanges':
##
## collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
##
## Attaching package: 'mgcv'
## The following object is masked from 'package:futile.logger':
##
## scat
## Loading required package: IHW
##
## Attaching package: 'IHW'
## The following object is masked from 'package:ggplot2':
##
## alpha
## Loading required package: org.Mm.eg.db
## Loading required package: org.Hs.eg.db
## Loading required package: ggpubr
## Warning: package 'ggpubr' was built under R version 3.6.1
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:VennDiagram':
##
## rotate
## Loading required package: useful
## Warning: package 'useful' was built under R version 3.6.1
## Loading required package: checkmate
##
## Attaching package: 'checkmate'
## The following object is masked from 'package:matrixStats':
##
## anyMissing
## The following object is masked from 'package:Biobase':
##
## anyMissing
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
##
## collapse
## The following object is masked from 'package:biomaRt':
##
## select
## The following objects are masked from 'package:GSEABase':
##
## intersect, setdiff, union
## The following object is masked from 'package:graph':
##
## union
## The following object is masked from 'package:AnnotationDbi':
##
## select
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:matrixStats':
##
## count
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:GenomicRanges':
##
## intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
##
## intersect
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
##
## first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: plotly
## Warning: package 'plotly' was built under R version 3.6.1
##
## Attaching package: 'plotly'
## The following object is masked from 'package:biomaRt':
##
## select
## The following object is masked from 'package:ComplexHeatmap':
##
## add_heatmap
## The following object is masked from 'package:AnnotationDbi':
##
## select
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:IRanges':
##
## slice
## The following object is masked from 'package:S4Vectors':
##
## rename
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
devtools::document()
## Updating DESeqtools documentation
## Warning: @format [E:\Deniz\Git\usr\bin\DESeqtools_Deniz\R\data.R#244]:
## requires a value
## Writing NAMESPACE
## Loading DESeqtools
## Warning in (function (dep_name, dep_ver = "*") : Need RColorBrewer >= 2.0.0
## but loaded version is 1.1-2
## Writing NAMESPACE
Since the gene ontology files are too large to put them on github, you have to load them in two chunks and combine them.
# for human
GO_hs <- rbind(GO_hs_1, GO_hs_2)
# for mouse
GO_mm <- rbind(GO_mm_1, GO_mm_2)
Please specify important information on the data set.
Scienfitic question:
QC:
Project directory on blades (processed files):
Workflow Summary:
Space for notes concerning the global workflow and quality checks.
Obligatory structure of your directory:
The project directory needs to contain a folder /Data, which contains:
The gene annotation files as well as the GMT files for human and mouse are loaded with the package.
Specify the organism of the data set
organism = "human" # choose "mouse" or "human"
Specify the project directory here:
If you want to run the script on your own data, you can specify the directory and save it unrer “dir”. Then, the output directories are automatically created.
dir <- "E:/Deniz/Git/usr/bin/DESeqtools_Deniz/inst/extdata"
# creating output directories (if not already existing):
dir.create(file.path(dir, "Analysis", "Tables"), recursive = T)
dir.create(file.path(dir, "Analysis", "Plots"), recursive = T)
For demonstation purposes, we take data which is installed with the package and therefore use another file directory that points to the preinstalled files.
dir <- system.file("extdata", package = "DESeqtools")
Define you colour schemes according to your data set and your variables!
For hex colors use: https://www.color-hex.com
# cell_Type
col_cellType <- c("#81c354", "#86BE9E")
names(col_cellType) <- c("moDC", "IL10APC")
# treatment
col_treat = c("#CCBD80","#FFEDA0", "#FEB24C", "#F03B20")
names(col_treat) <- c("undiff", "LPS","IL10", "LPS_IL10")
# donor
col_donor <- c("#66C3D0", "#008DD2","#0066D2")
names(col_donor) <- c("BC2", "BC3","BC5")
# merged: cell_Type and treatment
col_condition <- c(brewer.pal(3, "Blues")[2:3], brewer.pal(3, "Oranges")[2:3])
names(col_condition) <- c("Immature_moDCs","LPStreated_moDCs","Immature_IL10APCs","LPStreated_IL10APCs")
# combine color code into list
ann_colors_new <- list(cell_Type = col_cellType,
treatment=col_treat,
donor = col_donor,
condition=col_condition)
This gene annotation file will be used 1) to map the Ensembl transcript IDs to Ensembl gene IDs during the tximport function and 2) annotate the Ensembl IDs with additional information such as gene symbol or type.
For consistency, this file should have been produced from the .gtf file used for building the kallisto index. It needs to consist of four columns: Gene ENSEMBL ID, Transcript ID, Gene Symbol, Gene Type. ## Load sample table
Now, we read a table containing all available metainformation for the samples. This table needs to be prepared beforehand from information on the sequencing tracker or provided by the experimental partners. Usually, you would read this from your local computer. Here, we just that the example sampletable, that comes with the DESeqtools-package.
sample_table <- example_sampletable
rownames(sample_table) <- sample_table$ID
For the correct order of the samples in later plots, we define the column of our sample table that contains the information of interest and reorder its factor levels.
## Add columns with factors for comparisons in model
sample_table$condition <- factor(sample_table$condition,
levels = c("Immature_moDCs", "Immature_IL10APCs", "LPStreated_IL10APCs", "LPStreated_moDCs"))
# order according to factor of interest
sample_table <- sample_table[order(sample_table$condition),]
# define factor for order of samples in plotting
plot_order <- "condition"
The following visualizations of the data quality check results are only an addition to the much more comprehensive quality plots provided in the MultiQC .html files.
Please check the multiQC output thoroughly before you perfrom any analysis.
Read and visualize fastQC MultiQC output.
For many projects, multiple sequencing runs are necessary to get a satisfactory sequencing depth for all samples. Furthermore, samples can be distributed onto multiple lanes of a flowcell.Since the quality of each single run and lane can differ, we want to look at all fastq files from all runs and lanes seperately.
For more information on fastQC, please check: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
fastq_qc <- read.delim(file.path(dir, "Data", "output", "qc", "multiqc", "multiqc_data",
"multiqc_fastqc.txt"),
stringsAsFactors = F)
fastq_qc$ID <- unlist(lapply(strsplit(fastq_qc$Sample, split = "_"), `[[`, 4)) # sample identifiers
fastq_qc$run <- unlist(lapply(strsplit(fastq_qc$Sample, split = "_"), `[[`, 2)) # run identifiers
fastq_qc$lane <- unlist(lapply(strsplit(fastq_qc$Sample, split = "_"), `[[`, 6)) # lane identifiers
rownames(fastq_qc) <- paste(fastq_qc$run, fastq_qc$ID, fastq_qc$lane, sep = "_")
Visualization: Heatmap of fastQC quality checks
fastQC_HC <- t(fastq_qc[,colnames(fastq_qc) %in% c("per_base_sequence_quality",
"per_sequence_quality_scores",
"per_base_sequence_content",
"per_sequence_gc_content",
"per_base_n_content",
"sequence_length_distribution",
"sequence_duplication_levels",
"overrepresented_sequences",
"adapter_content")])
draw(Heatmap(fastQC_HC,
c("firebrick2", "forestgreen", "goldenrod1"),
column_title = "FastQC results"),
heatmap_legend_side = "left")
Read and visualize kallisto MultiQC output.
Before alignment, multiple fastq files from multiple lanes or runs for a single sample are merged and aligned as one sample. Thus, we have only one alignment score per sample.
Since we want to later visualize the alignment statistics in analysis plots, we add this information to our sample table.
kallisto_qc <- read.delim(file.path(dir, "Data", "output", "kallisto", "multiqc-kallisto", "multiqc_data", "multiqc_kallisto.txt"), stringsAsFactors = F)
# change sample column
kallisto_qc$Sample <- do.call(rbind, strsplit(kallisto_qc$Sample, split = "\\_"))[,1]
# Merge kallisto QC to sample table
sample_table <- merge(sample_table, kallisto_qc, by.x = "ID", by.y = "Sample")
Visualization of pseudoaligned and total reads:
alignstat <-sample_table[,c("ID", "pseudoaligned_reads","total_reads")]
alignstat <-gather(alignstat, "total_reads", "pseudoaligned_reads", key = "type", value = "reads")
ggplot(alignstat, aes(x = reorder(ID,-reads), y = reads, fill = type)) +
geom_bar(stat = "identity", position = "dodge", colour = "black") +
scale_y_continuous(breaks = c(c(5 , 10, 25, 50, 100) * 10^6))+
scale_fill_manual(name = "", values = c("white", "grey")) +
ylab("Reads") +
xlab("Sample IDs") +
geom_hline(yintercept=5 * 10^6, linetype="dashed", color = "red") +
ggtitle("Total reads and pseudoaligned reads") +
theme(axis.text.x = element_text(angle = 70, size = 9, hjust = 1, vjust = 1),
strip.background = element_rect(fill = "white"),
panel.background = element_rect(fill = "white", colour = "black"),
legend.position = "bottom"
)
In case, the data set contains samples of poor quality or simply to few reads, we might want to exclude these samples. In the following chunks, all samples with less than 5.000.000 reads are excluded from the subsequent analysis.
# define read cutoff
samples_to_keep <- sample_table[sample_table$pseudoaligned_reads > 5000000,]$ID
# make reduced sampleTable file
sample_table <- sample_table[which(sample_table$ID %in% samples_to_keep),]
rownames(sample_table) <- sample_table$ID
A new and recommended pipeline for RNA-seq analysis is to use fast transcript abundance quantifiers, such as kallisto or Salmon, upstream of DESeq2, and then to create gene-level count matrices for use with DESeq2 by importing the quantification data using the tximport package.
We use tximport and DESeq2 based on the gene-level estimated counts from Kallisto (Bray, Pimentel, Melsted, Pachter 2016).
Some advantages of using this methods for transcript abundance estimation are:
Full details on the motivation and methods for importing transcript level abundance and count estimates, summarizing to gene-level count matrices and producing an offset which corrects for potential changes in average transcript length across samples are described in (Soneson, Love, and Robinson 2015). Note that the tximport-to-DESeq2 approach uses estimated gene counts from the transcript abundance quantifiers, but not normalized counts.
TXimport imports transcript-level abundance, estimated counts and transcript lengths, and summarizes these into matrices for use with downstream statistical analysis packages such as edgeR, DESeq2, limma-voom. Average transcript length, weighted by sample-specific transcript abundance estimates, is provided as a matrix, which is then used as an offset for different expression of gene-level counts.
# Define path where the Kallisto files are stored
files <- paste(dir, "/Data/output/kallisto/kallisto/", sample_table$ID, "/abundance.h5", sep = "")
# Naming the entries in the vector assures correct column names in the expression tables
names(files) <- sample_table$ID
# choose your tx2gene-annotation based on the annotation you used for kallisto
tx_annotation <- tx_annotation_v27_human
# make sure sample annotation and counts are in the same order
identical(rownames(sample_table), names(files))
## [1] TRUE
# reorder the files to import
files <- files[as.character(sample_table$ID)]
identical(rownames(sample_table), names(files)) # should be true now
## [1] TRUE
# Import samples and perform the distribution of transcripts to genes
txi_kallisto <- tximport(files,
type="kallisto",
tx2gene=tx_annotation[,2:1])
## 1 2 3 4 5 6 7 8 9 10 11 12
## summarizing abundance
## summarizing counts
## summarizing length
The object class used by the DESeq2 package to store the read counts and the intermediate estimated quantities during statistical analysis is the DESeqDataSet, which will usually be represented in the code here as an object called “dds”.
A DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them (it will be coerced into a formula if it is not already). The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the dispersions and to estimate the log2 fold changes of the model.
dds_txi <- DESeqDataSetFromTximport(txi = txi_kallisto,
colData = sample_table,
design = ~ condition)
## using counts and average transcript lengths from tximport
While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: * by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and * we increase the speed of the transformation and testing functions within DESeq2.
Here, we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.
Note that more strict filtering to increase power is automatically applied via independent filtering or independent hypothesis weighting on the mean of normalized counts within the results function.
genes_to_keep <- rowSums(counts(dds_txi)) >= 10
dds <- dds_txi[genes_to_keep,]
Number of genes after filtering is: 24016
DESeq2: Estimate variance-mean dependence (2) in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution (1).
In inferential testing a distributional assumption is needed because we want to estimate the probability of extreme events just appearing by chance (e.g. large fold change) from limited replicates. The test statistic of ANOVA (or t-test) follows a Student’s t distribution, a specific case of the normal distribution. However, counts, as produced in RNA-seq experiments, cannot be normally distributed by definition(you can’t have -3 counts, or 12.2 counts). Two distributions for count based data are Poisson, which presumes that the variance and mean are equal, or negative binomial, a.k.a. Gamma-Poisson, which does not. The spread of values among biological replicates is more than given by the one parameter Poisson distribution and it seems to be captured by the two parameter (mean & variance) NB sufficiently well.
Information sharing across genes for variance estimation:
In order to test the differential expression of a gene, we need to estimate its mean and variance for the underlying negative binomial distribution. Inferential methods that treat each gene separately suffer from the high uncertainty of within-group variance estimates. However, this limitation can be overcome by pooling information across genes, specifically, by exploiting assumptions about the similarity of the variances of different genes measured in the same experiment . DESeq2 detects and corrects dispersion estimates through modeling of the dependence of the dispersion on the average mean over all samples.
The standard differential expression analysis steps are wrapped into a single function, DESeq(). The estimation steps performed by this function are described in the manual page for ?DESeq and in the Methods section of the DESeq2 publication (Love, Huber, and Anders 2014).
This function performs a default analysis through the steps:
Estimation of size factors: estimateSizeFactors
Estimation of dispersion: estimateDispersions
Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest
For complete details on each step, see the manual pages of the respective functions.
dds <- DESeq(dds)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
For inspection of the normalized data, we write the normalized counts into a data.frame called “norm_anno”.
norm_anno <- as.data.frame(counts(dds, normalized=T))
unnorm_anno<-as.data.frame(counts(dds, normalized=F))#Deniz
norm_anno$GENEID <- row.names(norm_anno)
unnorm_anno$GENEID <- row.names(unnorm_anno)#Deniz
# add gene annotation extracted from the gtf file
gene_annotation <- tx_annotation[!duplicated(tx_annotation$GENEID),c("GENEID", "SYMBOL", "GENETYPE")]
gene_annotation <- gene_annotation[match(rownames(norm_anno), gene_annotation$GENEID),]
gene_annotation <- gene_annotation[match(rownames(unnorm_anno), gene_annotation$GENEID),]#Deniz
# check if row names of the normalized table and the gene annotation match perfectly
all(rownames(norm_anno) == gene_annotation$GENEID)
## [1] TRUE
all(rownames(unnorm_anno) == gene_annotation$GENEID)#Deniz
## [1] TRUE
# add additional gene annotation downloaded from biomart
if(organism == "human"){
biomart <- biomart_human
} else if( organism == "mouse"){
biomart <- biomart_mouse
}
idx <- match(unlist(lapply(strsplit(gene_annotation$GENEID, split = "[.]"), `[[`, 1)), biomart$Gene.stable.ID)
gene_annotation$DESCRIPTION <- biomart$Gene.description[idx]
gene_annotation$CHR <- biomart$Chromosome.scaffold.name[idx]
# merge expression table and annotation
norm_anno <- merge(norm_anno,
gene_annotation,
by = "GENEID")
rownames(norm_anno) <- norm_anno$GENEID
norm_anno[1:3,c(1:2, (ncol(norm_anno)-5):ncol(norm_anno))]
## GENEID 1553 1567 1568
## ENSG00000000003.14 ENSG00000000003.14 0.00000 1.236363 0.803422
## ENSG00000000419.12 ENSG00000000419.12 312.45092 275.259969 245.334723
## ENSG00000000457.13 ENSG00000000457.13 76.17025 83.453860 76.727878
## SYMBOL GENETYPE
## ENSG00000000003.14 TSPAN6 protein_coding
## ENSG00000000419.12 DPM1 protein_coding
## ENSG00000000457.13 SCYL3 protein_coding
## DESCRIPTION
## ENSG00000000003.14 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## ENSG00000000419.12 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## ENSG00000000457.13 SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## CHR
## ENSG00000000003.14 X
## ENSG00000000419.12 20
## ENSG00000000457.13 1
#Deniz
unnorm_anno <- merge(unnorm_anno,
gene_annotation,
by = "GENEID")
rownames(unnorm_anno) <- unnorm_anno$GENEID
unnorm_anno[1:3,c(1:2, (ncol(unnorm_anno)-5):ncol(unnorm_anno))]
## GENEID 1553 1567 1568 SYMBOL GENETYPE
## ENSG00000000003.14 ENSG00000000003.14 0 1 1 TSPAN6 protein_coding
## ENSG00000000419.12 ENSG00000000419.12 182 282 381 DPM1 protein_coding
## ENSG00000000457.13 ENSG00000000457.13 48 88 131 SCYL3 protein_coding
## DESCRIPTION
## ENSG00000000003.14 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## ENSG00000000419.12 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## ENSG00000000457.13 SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## CHR
## ENSG00000000003.14 X
## ENSG00000000419.12 20
## ENSG00000000457.13 1
#Boxplot and comparison btw unnorm_anno vs norm_anno
######first, create a sample table just taking the sample ID and condition for boxplot
box_sample_table<-sample_table[,c("ID","condition")]#depending on choice, change the condition to smt else
#ordering for the legend labels
legend_order <- factor(unique(sample_table$condition),
levels = unique(sample_table$condition))
legend_order<-levels(legend_order)
######norm_anno boxplot
box_norm_table<-norm_anno[,colnames(norm_anno)%in%box_sample_table$ID]
box_norm_table$GENEID<-rownames(box_norm_table)
###restructuring the table for ggplot2 analysis w/ melt function
box_norm_table<-melt(box_norm_table, id.vars=c("GENEID"))
colnames(box_norm_table)<-c("GENEID","sample","expression")
box_norm_table<-merge(box_norm_table,box_sample_table,by.x="sample",by.y="ID")
box_norm_table$condition<-factor(box_norm_table$condition, levels=legend_order)
p1<-ggplot(box_norm_table, mapping = aes(x=sample , y= expression+1,fill=condition))+
geom_boxplot()+
#geom_boxplot(outlier.shape = NA)+ #to make the outlier points transparent
scale_y_log10()+
scale_fill_manual(values = col_condition)+
#scale_fill_discrete(breaks=fill)+
theme_bw() +
theme(axis.text.x = element_text(angle = 90))+
xlab("Samples") + ylab("normalized expression") +
ggtitle("Normalized gene expression analysis")
######unnorm_anno boxplot
box_unnorm_table<-unnorm_anno[,colnames(unnorm_anno)%in%box_sample_table$ID]
box_unnorm_table$GENEID<-rownames(box_unnorm_table)
###restructuring the table for ggplot2 analysis w/ melt function
box_unnorm_table<-melt(box_unnorm_table, id.vars=c("GENEID"))
colnames(box_unnorm_table)<-c("GENEID","sample","expression")
box_unnorm_table<-merge(box_unnorm_table,box_sample_table,by.x="sample",by.y="ID")
box_unnorm_table$condition<-factor(box_unnorm_table$condition, levels=legend_order)
p2<-ggplot(box_unnorm_table, mapping = aes(x=sample , y= expression+1,fill=condition))+
geom_boxplot()+
#geom_boxplot(outlier.shape = NA)+ #to make the outlier points transparent
scale_y_log10()+ #to change the y-axis limit
scale_fill_manual(values = col_condition)+
#scale_fill_discrete(breaks=fill)+
theme_bw() +
theme(axis.text.x = element_text(angle = 90),legend.position="none")+
xlab("Samples") + ylab("unnormalized expression") +
ggtitle("Unnormalized gene expression analysis")
#combining two boxplots into one plot with common legend
library(ggpubr)
ggarrange(p2, p1, ncol=2, common.legend = TRUE, legend="bottom")
In order to test for differential expression, we operate on raw counts and use discrete distributions. However for other downstream analyses - e.g. for visualization or clustering - it might be useful to work with transformed versions of the count data.
Maybe the most obvious choice of transformation is the logarithm. Since count values for a gene can be zero in some conditions (and non-zero in others), some advocate the use of pseudocounts, i.e. transformations of the form: y=log2(n+n0) where n represents the count values and n0 is a positive constant.
DESeq2 has two alternative approaches that offer more theoretical justification and a rational way of choosing parameters equivalent to n0 above. One makes use of the concept of variance stabilizing transformations (VST) (Tibshirani 1988; Huber et al. 2003; Anders and Huber 2010), and the other is the regularized logarithm or rlog, which incorporates a prior on the sample differences (Love, Huber, and Anders 2014). Both transformations produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors.
The point of these two transformations, the VST and the rlog, is to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low. Both VST and rlog use the experiment-wide trend of variance over mean, in order to transform the data to remove the experiment-wide trend. Note that we do not require or desire that all the genes have exactly the same variance after transformation. Indeed, in a figure below, you will see that after the transformations the genes with the same mean do not have exactly the same standard deviations, but that the experiment-wide trend has flattened. It is those genes with row variance above the trend which will allow us to cluster samples into interesting groups.
The two functions, vst and rlog have an argument blind, for whether the transformation should be blind to the sample information specified by the design formula. When blind equals TRUE (the default), the functions will re-estimate the dispersions using only an intercept. This setting should be used in order to compare samples in a manner wholly unbiased by the information about experimental groups, for example to perform sample QA (quality assurance) as demonstrated below.
However, blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explainable by the experimental design, and one wishes to transform the data for downstream analysis. In this case, using blind dispersion estimation will lead to large estimates of dispersion, as it attributes differences due to experimental design as unwanted noise, and will result in overly shrinking the transformed values towards each other. By setting blind to FALSE, the dispersions already estimated will be used to perform transformations, or if not present, they will be estimated using the current design formula. Note that only the fitted dispersion estimates from mean-dispersion trend line are used in the transformation (the global dependence of dispersion on mean for the entire experiment). So setting blind to FALSE is still for the most part not using the information about which samples were in which experimental group in applying the transformation.
# Please choose to use rlog or VST for the transformation: rlog is recommended for less than 30 samples, vst for more than 30 samples for the sake of computing time
dds_vst <- rlog(dds, blind = TRUE)
# dds_vst <- vst(dds, blind = TRUE)
For later plotting of Heatmaps, it may be useful to have a dataframe of the variance stabilized values as log2-transformed (vst_anno_log2) as well as as normal counts (vst_anno):
vst_anno_log2<-as.data.frame(assay(dds_vst))
vst_anno_log2<-merge(vst_anno_log2,norm_anno[,c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")],by="row.names")
rownames(vst_anno_log2)<-vst_anno_log2$Row.names
vst_anno_log2$Row.names<-NULL
vst_anno<-as.data.frame(assay(dds_vst))
vst_anno<-2^vst_anno
vst_anno<-merge(vst_anno,norm_anno[,c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")],by="row.names")
rownames(vst_anno)<-vst_anno$Row.names
vst_anno$Row.names<-NULL
meanSdPlot(as.matrix(assay(dds_vst)), ranks = FALSE)
# choose columns from sampletable for heatmap annotation
plot_annotation <- sample_table[,c("condition","donor","pseudoaligned_reads"), drop = F]
rownames(plot_annotation) <- sample_table$ID
# choose colours for your conditions
ann_colors <- setColours(factorcolours = c("Set1","Set2"))
norm_mean<-mean_function(input=norm_anno,
anno=sample_table,
condition="condition")#Deniz: depending on your desired condition, can change it and put any other column name from sample_table
vst_mean<-mean_function(input=vst_anno,
anno = sample_table,
condition = "condition")#Deniz: depending on your desired condition, can change it and put any other column name from sample_table
library(useful)
corner(vst_mean)
## GENEID Immature_moDCs LPStreated_moDCs
## ENSG00000000003.14 ENSG00000000003.14 2.01630 1.600599
## ENSG00000000419.12 ENSG00000000419.12 283.11881 298.194813
## ENSG00000000457.13 ENSG00000000457.13 91.28087 138.085578
## ENSG00000000460.16 ENSG00000000460.16 37.03521 26.622899
## ENSG00000000938.12 ENSG00000000938.12 2724.02226 922.685747
## Immature_IL10APCs LPStreated_IL10APCs
## ENSG00000000003.14 1.650472 1.630697
## ENSG00000000419.12 307.124451 256.273981
## ENSG00000000457.13 102.633810 97.864680
## ENSG00000000460.16 37.612088 36.503026
## ENSG00000000938.12 2727.889866 2113.536536
Mean sample table
# function to create mean_sample_table depending on your condition called mean_sample_definition
mean_sample_table<- sample_table[!duplicated(sample_table$condition),]#mean sample table is generated firsly from sample_table
rownames(mean_sample_table)<-mean_sample_table$condition
mean_sample_table<-subset(mean_sample_table,select=-c(ID,donor))#ID and donor columns are excluded
mean_sample_table<-mean_sample_definition(input=mean_sample_table,
anno=sample_table,
condition="condition")
## [1] "Immature_moDCs"
## [1] "LPStreated_moDCs"
## [1] "Immature_IL10APCs"
## [1] "LPStreated_IL10APCs"
mean_sample_table
## condition cell_Type treatment total_reads
## Immature_moDCs Immature_moDCs moDC undiff 9786258
## LPStreated_moDCs LPStreated_moDCs moDC LPS 12241293
## Immature_IL10APCs Immature_IL10APCs IL10APC IL10 9674108
## LPStreated_IL10APCs LPStreated_IL10APCs IL10APC LPS_IL10 18533050
## pseudoaligned_reads not_pseudoaligned_reads
## Immature_moDCs 7957867 1828391
## LPStreated_moDCs 10284543 1956751
## Immature_IL10APCs 8654087 1020021
## LPStreated_IL10APCs 16387226 2145823
## percent_aligned ID
## Immature_moDCs 80.92075 Immature_moDCs
## LPStreated_moDCs 83.96828 LPStreated_moDCs
## Immature_IL10APCs 89.24532 Immature_IL10APCs
## LPStreated_IL10APCs 88.37550 LPStreated_IL10APCs
corner(mean_sample_table)
## condition cell_Type treatment total_reads
## Immature_moDCs Immature_moDCs moDC undiff 9786258
## LPStreated_moDCs LPStreated_moDCs moDC LPS 12241293
## Immature_IL10APCs Immature_IL10APCs IL10APC IL10 9674108
## LPStreated_IL10APCs LPStreated_IL10APCs IL10APC LPS_IL10 18533050
## pseudoaligned_reads
## Immature_moDCs 7957867
## LPStreated_moDCs 10284543
## Immature_IL10APCs 8654087
## LPStreated_IL10APCs 16387226
Specify mean sample annotation for plotting
# choose columns from sampletable for heatmap annotation
#consistency for writing in capital or small letters!
mean_plot_annotation <- mean_sample_table[,c("condition","pseudoaligned_reads"), drop = F]
rownames(mean_plot_annotation) <- mean_sample_table$condition
# plot_order
mean_sample_table$condition<- factor(mean_sample_table$condition,
levels=c("Immature_moDCs","LPStreated_moDCs","Immature_IL10APCs","LPStreated_IL10APCs"))
plot_order<-"condition"
TypeCounts <- as.data.frame(table(norm_anno$GENETYPE))
colnames(TypeCounts) <- c("Type", "Frequency")
TypeCounts <- subset(TypeCounts, Frequency>0)
ggplot(TypeCounts, aes(x=Type, y= Frequency, label=Frequency))+
geom_bar(stat="identity",fill="grey",colour="grey") +
theme_bw()+
geom_text(size = 3, position = position_stack(vjust = 1))+
guides(fill=FALSE)+
theme(text = element_text(size=10),axis.text.x = element_text(angle =90, hjust = 1))+
xlab("")
rMeans <- as.data.frame(log(rowMeans(counts(dds, normalized=TRUE)),10))
colnames(rMeans) <- "rowMeans"
ggplot(rMeans, aes(x = rowMeans)) +
geom_histogram(bins=100) +
xlab("log10(rowMeans)")+
scale_x_continuous(breaks=c(0,1,2,3,4,5,6))+
theme_bw()
highestGenes(numGenes = 50)
Plot a heatmap of all hierarchically clustered present genes.
I case your machine has only 8 gb of ram, plotting a heatmap of all present genes might exceed your memory capacities and give the error: “Error: cannot allocate vector of size XX Gb”. In that case, you can either reduce the genes to present based on a higher expression cutoff or continue with the heatmap of variable genes.
plotHeatmap(input=norm_anno,
smp_table = sample_table,
plot_anno = plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (normalized counts)",
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
plotHeatmap(input=vst_anno,
smp_table = sample_table,
plot_anno = plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (log2 normalized counts)",
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
plotHeatmap(input=norm_mean,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (normalized counts)",
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
plotHeatmap(input=vst_mean,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (log2 normalized counts)",
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
plotHeatmap(input=norm_anno ,
smp_table = sample_table,
plot_anno = plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (normalized counts)",
show_rownames = FALSE,
cluster_cols = FALSE,
gene_type = "all")
plotHeatmap(input=vst_anno ,
smp_table = sample_table,
plot_anno = plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (log2 normalized counts)",
show_rownames = FALSE,
cluster_cols = FALSE,
gene_type = "all")
plotHeatmap(input=norm_mean,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (normalized counts)",
show_rownames = FALSE,
cluster_cols = FALSE,
gene_type = "all")
plotHeatmap(input=vst_mean,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (log2 normalized counts)",
show_rownames = FALSE,
cluster_cols = FALSE,
gene_type = "all")
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 8155733 435.6 15768767 842.2 15768767 842.2
## Vcells 55058708 420.1 1262836672 9634.7 1578545839 12043.4
# define variable genes
rv = genefilter::rowVars(assay(dds_vst))
q75 = quantile(rowVars(assay(dds_vst)), .75)
q75_names = names(which(rv > q75))
plotHeatmap(input=norm_anno,
smp_table=sample_table,
geneset =q75_names,
title = "Heatmap of Top 75% most variable genes (log2 normalized counts)",
show_rownames = FALSE,
cluster_cols = TRUE,
plot_anno=plot_annotation,
gene_type = "all")
plotHeatmap(input=vst_anno,
smp_table=sample_table,
geneset =q75_names,
title = "Heatmap of Top 75% most variable genes (log2 normalized counts)",
show_rownames = FALSE,
cluster_cols = TRUE,
plot_anno=plot_annotation,
gene_type = "all")
plotHeatmap(input=norm_mean,
smp_table=mean_sample_table,
geneset =q75_names,
title = "Heatmap of Top 75% most variable genes (log2 normalized counts)",
show_rownames = FALSE,
cluster_cols = TRUE,
plot_anno=mean_plot_annotation,
gene_type = "all")
plotHeatmap(input=norm_anno,
smp_table=sample_table,
geneset = q75_names,
title = "Heatmap of all most variable genes",
show_rownames = FALSE,
cluster_cols = FALSE,
plot_anno=plot_annotation,
gene_type = "all")
plotHeatmap(input=vst_mean,
smp_table=mean_sample_table,
geneset = q75_names,
title = "Heatmap of all most variable genes",
show_rownames = FALSE,
cluster_cols = FALSE,
plot_anno=mean_plot_annotation,
gene_type="all")
plotHeatmap(input=vst_mean,
smp_table=mean_sample_table,
geneset = q75_names,
title = "Heatmap of all most variable genes",
show_rownames = FALSE,
cluster_cols = FALSE,
plot_anno=mean_plot_annotation,
gene_type="antisense_RNA")
cd_input<-as.data.frame(assay(dds_vst))
# Correlation based on variance-stabilized counts
# sampleCor <- as.matrix(cor(assay(dds_vst), use="all.obs", method="pearson"))
# rownames(sampleCor)<- sample_table$ID
# colnames(sampleCor)<- sample_table$ID
#
# pheatmap(sampleCor,
# main="Sample Correlation based on variance-stabilized counts",
# annotation_row = plot_annotation,
# annotation_col = plot_annotation,
# annotation_colors = ann_colors,
# cluster_rows = F,
# cluster_cols = F,
# fontsize = 8)
corr_function(sampleCor = cd_input,
title="Sample Correlation based on variance-stabilized counts",
gene_anno=gene_annotation,
plot_anno=plot_annotation,
gene_type="all",
cluster_rows=F,
cluster_cols=F,
mean=F)
corr_function(sampleCor = cd_input,
title="Sample Correlation based on variance-stabilized counts",
gene_anno=gene_annotation,
plot_anno=mean_plot_annotation,
gene_type="all",
cluster_rows=F,
cluster_cols=F,
mean=T)
corr_function(sampleCor = cd_input,
title="Sample Correlation based on variance-stabilized counts",
gene_anno=gene_annotation,
plot_anno=plot_annotation,
gene_type="ribozyme",
cluster_rows=F,
cluster_cols=F,
mean=F)
This function computes and returns the euclidean distance matrix between the rows of a data matrix, the samples in our case.
# Sample Distances based on variance-stabilized counts
# sampleDist <- as.matrix(dist(t(assay(dds_vst))))
# rownames(sampleDist)<- sample_table$ID
# colnames(sampleDist)<- sample_table$ID
#
# pheatmap(sampleDist,
# clustering_distance_rows = dist(t(assay(dds_vst))),
# clustering_distance_cols = dist(t(assay(dds_vst))),
# main="Sample distances based on variance-stabilized counts per sample",
# annotation_row = plot_annotation,
# annotation_col = plot_annotation,
# annotation_colors = ann_colors,
# fontsize = 8)
dist_function(sampleDist = cd_input,
gene_anno=gene_annotation,
plot_anno=plot_annotation,
title="Sample distances based on variance-stabilized counts per sample",
gene_type="all",
mean=F)
dist_function(sampleDist = cd_input,
gene_anno=gene_annotation,
plot_anno=mean_plot_annotation,
title="Sample distances based on variance-stabilized counts per sample",
gene_type="all",
mean=T)
Related to the distance matrix is the PCA plot, which shows the samples in the 2D plane spanned by two principal components. This type of plot is useful for visualizing the overall effect of experimental covariates and batch effects.
Principal component analysis (PCA) simplifies the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the data into fewer dimensions, which act as summaries of features. High-dimensional data are very common in biology and arise when multiple features, such as expression of many genes, are measured for each sample. This type of data presents several challenges that PCA mitigates: computational expense and an increased error rate due to multiple test correction when testing each feature for association with an outcome. PCA is an unsupervised learning method and is similar to clustering1—it finds patterns without reference to prior knowledge about whether the samples come from different treatment groups or have phenotypic differences.
To understand the basics of PCA, please watch: https://www.youtube.com/watch?v=_UVHneBUBW0
# Extract the eigenvalues/variances of the principal dimensions
eigenvalue <- get_eig(prcomp(t(assay(dds_vst))))
eigenvalue$dim <- as.numeric(c(1:nrow(eigenvalue)))
ggplot(eigenvalue, aes(dim, variance.percent))+
geom_bar(stat = "identity")+
geom_line(aes(dim, variance.percent)) +
geom_point(aes(dim, variance.percent)) +
geom_line(aes(dim, cumulative.variance.percent), colour= "grey") +
geom_point(aes(dim, cumulative.variance.percent), colour= "grey") +
scale_x_continuous(breaks=c(1:nrow(eigenvalue)))+
xlab("Dimensions") +
ylab("Percentage of explained variances") +
theme_bw()
plotPCA(ntop="all",
xPC=1,
yPC=2,
color= "condition",
anno_colour = ann_colors$condition,
shape="NULL",
point_size=3,
gene_type = "all",
label= "ID",
title="Principle Component Analysis based on variance-stabilized counts",
gene_anno = gene_annotation)
plotPCA(ntop="all",
xPC=1,
yPC=2,
color= "condition",
anno_colour = ann_colors$condition,
shape="NULL",
point_size=3,
gene_type = "protein_coding",
label= "ID",
title="Principle Component Analysis based on variance-stabilized counts",
gene_anno = gene_annotation)
plotPCA(ntop="all",
xPC=1,
yPC=2,
color="donor",
anno_colour=ann_colors$donor,
shape="NULL",
point_size=3,
label= "ID",
label_subset = "1553",
title="Principle Component Analysis based on variance-stabilized counts")
plotPCA(ntop="all",
xPC=1,
yPC=2,
color="pseudoaligned_reads",
anno_colour= ann_colors$pseudoaligned_reads,
shape="NULL",
point_size=3,
title="Principle Component Analysis based on variance-stabilized counts")
Please comment this chunk out when knitting to an HTML file because it will lead to a corrupt HTML file. #@Deniz: you can test this out as well :D
plot3D_pca(pca3d_input=dds_vst,
sample_table=sample_table,
gene_anno=gene_annotation,
gene_type="all",
xPC=1,
yPC=2,
zPC=3,
ntop="all",
anno_colour = col_condition,
point_size =10)
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
plotLoadings(PC="PC1", ntop="all")
plotLoadings(PC="PC2", ntop="all")
plotLoadings(PC="PC3", ntop="all")
In the following chunk, we define the upper quartile of the variable genes and identify those genes with a high positive correlation (r>0.95).
q75_vst = assay(dds_vst)[rowVars(assay(dds_vst)) > quantile(rowVars(assay(dds_vst)), .75),]
rcor <- rcorr(t(q75_vst),type="pearson")
rcor$sig <- rcor$P<0.05 & rcor$r>0 # define significant positive correlations
rcor_filt <- rcor$r*rcor$si
rcor_filt <- rcor_filt*upper.tri(rcor_filt)
rcor_filt<- replace(rcor_filt, which( rcor_filt==0), NA)
rcor_filt_melt <- melt(rcor_filt,na.rm = TRUE)
rcor_filt_melt_cutoff <- rcor_filt_melt[rcor_filt_melt$value>0.95,]
varR <- unique(rcor_filt_melt_cutoff$Var1)
plotHeatmap(input=vst_anno,
smp_table=sample_table,
geneset = varR,
title = "Heatmap of all variable and co-expressed genes",
show_rownames = FALSE,
cluster_cols = FALSE,
plot_anno=plot_annotation,
gene_type = "all")
Plot the expression of selected genes across conditions.
plotSingleGene(data=norm_anno,
symbol="ITGAM",
condition="condition",
anno_colour=ann_colors$condition,
shape= NULL)+theme(axis.text.x = element_text(angle = 90, hjust = 1))
## [1] "Selected gene symbol assigned to one gene (Ensembl ID)"
plotSingleGene(data=norm_anno,
symbol="WDR18",
condition="condition",
anno_colour=ann_colors$condition,
shape= "donor")+theme(axis.text.x = element_text(angle = 90, hjust = 1))
## [1] "Selected gene symbol assigned to one gene (Ensembl ID)"
Plot a heatmap of genes annotated to a selected GO, KEGG or Hallmark term.
plotGeneSetHeatmap(input=norm_anno,
smp_table = sample_table,
plot_anno = plot_annotation,
cat="GO",
term="neutrophil degranulation",
organism = "human",
show_rownames =F,
cluster_cols = F,
gene_type = "all")
plotGeneSetHeatmap(input=norm_mean,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
cat="GO",
term="neutrophil degranulation",
organism = "human",
show_rownames =F,
cluster_cols = F,
gene_type = "protein_coding")
plotGeneSetHeatmap(input=norm_anno,
smp_table = sample_table,
plot_anno = plot_annotation,
cat="KEGG",
term="Alzheimer disease",
organism = "human",
show_rownames =T,
cluster_cols = F,
gene_type = "all")
In case you observe a batch effect in the data, the following code can help to define and correct for known and unknown batch effects.
If you did not observe a batch effect, this section should be skipped.
In case the global analysis showed that your data suffers from a technical batch effect, e.g. your samples cluster according to a certain covariate such as “sex” or “date of experiment”, you might want to correct for this. For known batch effects, such as sequencing day or sex, you may want to try to include the variance explained by this covariate in your later differential gene expression analysis as a co-factor.
To check whether removing the variance caused by this covariate improves the clustering of your samples, you can correct your gene expression for this factor using limma and plot a PCA of the corrected data.
(https://bioconductor.org/packages/release/bioc/html/limma.html)
removedbatch_dds_vst <- limmaBatchEffectRemoval(input=dds_vst,
modelfactor = "condition",
batchfactor = "donor",
batchfactor_2 = NULL)
plotPCA(pca_input = removedbatch_dds_vst,
ntop="all",
xPC=1,
yPC=2,
color="condition",
anno_colour = ann_colors$condition,
shape="donor",
point_size=3,
title="PCA of batch-corrected counts",
gene_anno=gene_annotation,
gene_type="all")
For later plotting of Heatmaps, it may be useful to have a dataframe of the batch-corrected values as log2-transformed (removedbatch_anno_log2) as well as as normal counts (removedbatch_anno).
removedbatch_anno_log2<-removedbatch_dds_vst
removedbatch_anno_log2<-merge(removedbatch_anno_log2,norm_anno[,c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")],by="row.names")
rownames(removedbatch_anno_log2)<-removedbatch_anno_log2$Row.names
removedbatch_anno_log2$Row.names<-NULL
removedbatch_anno<-removedbatch_dds_vst
removedbatch_anno<-2^removedbatch_anno
removedbatch_anno<-merge(removedbatch_anno,norm_anno[,c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")],by="row.names")
rownames(removedbatch_anno)<-removedbatch_anno$Row.names
removedbatch_anno$Row.names<-NULL
removedbatch_mean_log2<-mean_function(input=removedbatch_anno_log2,
anno = sample_table,
condition = "condition")
corner(removedbatch_mean_log2)
## GENEID Immature_moDCs LPStreated_moDCs
## ENSG00000000003.14 ENSG00000000003.14 0.9663107 0.6747506
## ENSG00000000419.12 ENSG00000000419.12 8.1330584 8.2173440
## ENSG00000000457.13 ENSG00000000457.13 6.4973990 7.0915807
## ENSG00000000460.16 ENSG00000000460.16 5.1619892 4.7156306
## ENSG00000000938.12 ENSG00000000938.12 11.3824790 9.8007535
## Immature_IL10APCs LPStreated_IL10APCs
## ENSG00000000003.14 0.7022162 0.7023509
## ENSG00000000419.12 8.2592445 7.9952041
## ENSG00000000457.13 6.6757021 6.6051387
## ENSG00000000460.16 5.2226122 5.1857171
## ENSG00000000938.12 11.3644457 11.0233604
plotHeatmap(input=vst_anno,
smp_table = sample_table,
plot_anno = plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (variance stabilized counts [log2] )",
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
plotHeatmap(input=removedbatch_anno_log2,
smp_table = sample_table,
plot_anno = plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (batch-removed counts [log2])",
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
plotHeatmap(input=vst_mean,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (variance stabilized counts [log2] )",
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
plotHeatmap(input=removedbatch_mean_log2,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
geneset = "all",
title = "Heatmap of all present genes (batch-removed counts [log2])",
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
cd_input_batch<-removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")]
# Correlation based on variance-stabilized counts & batch-corrected counts
# sampleCor <- as.matrix(cor(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")], use="all.obs", method="pearson"))
# rownames(sampleCor)<- sample_table$ID
# colnames(sampleCor)<- sample_table$ID
#
# pheatmap(sampleCor,
# main="Sample Correlation based on batch-corrected counts",
# annotation_row = plot_annotation,
# annotation_col = plot_annotation,
# annotation_colors = ann_colors,
# cluster_rows = T,
# cluster_cols = T,
# fontsize = 8)
corr_function(sampleCor = cd_input_batch,
title="Sample Correlation based on batch-corrected counts",
gene_anno=gene_annotation,
plot_anno=plot_annotation,
gene_type="protein_coding",
mean=F)
corr_function(sampleCor = cd_input_batch,
title="Sample Correlation based on batch-corrected counts",
gene_anno=gene_annotation,
plot_anno=plot_annotation,
gene_type="all",
mean=F)
corr_function(sampleCor = cd_input_batch,
title="Sample Correlation based on batch-corrected counts",
gene_anno=gene_annotation,
plot_anno=mean_plot_annotation,
gene_type="all",
mean=T)
# Sample Distances based on batch-corrected counts
# sampleDist <- as.matrix(dist(t(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")])))
# rownames(sampleDist)<- sample_table$ID
# colnames(sampleDist)<- sample_table$ID
#
# pheatmap(sampleDist,
# clustering_distance_rows = dist(t(assay(dds_vst))),
# clustering_distance_cols = dist(t(assay(dds_vst))),
# main="Sample distances based on batch-corrected counts per sample",
# annotation_row = plot_annotation,
# annotation_col = plot_annotation,
# annotation_colors = ann_colors,
# fontsize = 8)
dist_function(sampleDist = cd_input_batch,
gene_anno=gene_annotation,
plot_anno=plot_annotation,
title="Sample distances based on batch-corrected counts per sample",
gene_type="all",
mean=F)
dist_function(sampleDist = cd_input_batch,
gene_anno=gene_annotation,
plot_anno=mean_plot_annotation,
title="Sample distances based on batch-corrected counts per sample",
gene_type="all",
mean=T)
Plot the expression of selected genes across conditions.
plotSingleGene(data=removedbatch_anno,
symbol="WDR18",
condition="condition",
anno_colour=ann_colors$condition,
shape= "donor")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## [1] "Selected gene symbol assigned to one gene (Ensembl ID)"
In case your samples do not cluster according to the condition of biological interested, but you observe distinct clustering according to an unknown latent variable, you may want to try identifying this “hidden”" batch variable using surrogate variable analysis (SVA).
The SVA package helps to define the variance in the data set that is not explained by your variables of interest and tries to model the respective surrogate variables. These surrogate variables can be included as factors in your DESeq2 model. (http://master.bioconductor.org/packages/release/workflows/html/rnaseqGene.html#removing-hidden-batch-effects; http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161)
The goal of the sva is to remove all unwanted sources of variation while protecting the contrasts due to the primary variables. This leads to the identification of features that are consistently different between groups, removing all common sources of latent variation. In some cases, the latent variables may be important sources of biological vari- ability. If the goal of the analysis is to identify heterogeneity in one or more subgroups, the sva function may not be appropriate.
The first step in using the sva package is to properly format the data and create appropriate model matrices. The data should be a matrix with features (genes, transcripts, voxels) in the rows and samples in the columns. Below we obtain a matrix of normalized counts for which the average count across samples is larger than 1. This is the typical genes by samples matrix found in gene expression analyses. The sva package assumes there are two types of variables that are being considered: (1) adjustment variables and (2) variables of interest. For example, in a gene expression study the variable of interest might an indicator of cancer versus control. The adjustment variables could be the age of the patients, the sex of the patients, and a variable like the date the arrays were processed.
Two model matrices must be made: the “full model” and the “null model”. The null model is a model matrix that includes terms for all of the adjustment variables but not the variables of interest. The full model includes terms for both the adjustment variables and the variables of interest. The assumption is that you will be trying to analyze the association between the variables of interest and gene expression, adjusting for the adjustment variables. The model matrices can be created using the model.matrix() function.
After the model matrices have been created, we can apply the svaseq function to estimate batch and other artifacts.
The svaseq function performs two different steps. First it identifies the number of latent factors that need to be estimated. If the sva function is called without the n.sv argument specified, the number of factors will be estimated for you. If you prefer a specific number of SVs, you can set the n.sv parameter accordingly.
# Format and filter the input
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
# Create the full model matrix - including both the adjustment variables and the variable of interest. In this case we only have one variable of interest called Genotype_Age.
mod <- model.matrix(~ condition, colData(dds))
# The null model contains only the adjustment variables. Since we are not adjusting for any other variables in this analysis, only an intercept is included in the model.
mod0 <- model.matrix(~ 1, colData(dds))
Apply the svaseq() function to estimate the surrogate variables:
The svaseq function returns a list with four components, sv, pprob.gam, pprob.b, n.sv:
sv is a matrix whose columns correspond to the estimated surrogate variables. They can be used in downstream analyses as described below. pprob.gam is the posterior probability that each gene is associated with one or more latent variables. pprob.b is the posterior probability that each gene is associated with the variables of interest. n.sv is the number of surrogate variables estimated by the sva.
svseq <- svaseq(dat,
mod,
mod0)
## Number of significant surrogate variables is: 2
## Iteration (out of 5 ):1 2 3 4 5
svseq$sv
## [,1] [,2]
## [1,] 0.0004899811 -0.44975950
## [2,] -0.1637062597 -0.20697600
## [3,] -0.0656291476 -0.46170196
## [4,] -0.0884280508 -0.42728237
## [5,] -0.3422290945 0.37945345
## [6,] -0.3188615390 0.28345666
## [7,] -0.2814525092 0.15177991
## [8,] -0.2712153647 0.18149453
## [9,] 0.3800205381 0.16048612
## [10,] 0.4188261910 0.16006204
## [11,] 0.3946231423 0.08964198
## [12,] 0.3375621129 0.13934513
You can display whether the calculated SV correlate with any of your known covariates by changing dds$condition to the column of your sample_table that you are interested in.
par(mfrow = c(2, ncol(svseq$sv)))
condition ="donor"
for (i in 1:ncol(svseq$sv)) {
stripchart(svseq$sv[, i] ~ dds[[condition]], vertical = TRUE, main = paste0("SV", i))
abline(h = 0)
}
Add surrogate variables to annotation table to re-analyse the data including the surrogate variables in the analysis.
for (i in 1:ncol(svseq$sv)) {
sample_table[[paste0("SV",i)]]<- svseq$sv[,i]
}
Again, we can check in a PCA what an influence removing the SVs has on the clustering of the samples.
removedbatch_dds_vst <- limmaBatchEffectRemoval(input=dds_vst,
modelfactor = "condition",
batchfactor = c("SV1","SV2","SV3"),
batchfactor_2 = NULL)
plotPCA(pca_input = removedbatch_dds_vst,
ntop="all",
xPC=1,
yPC=2,
color="condition",
anno_colour = ann_colors$condition,
point_size=3,
title="PCA of batch-corrected counts",
gene_anno=gene_annotation,
gene_type="all")
You can also use the batch-corrected table as input for all heatmaps and as input for the top varying genes as displayed in the following section:
plotHeatmap(input=removedbatch_anno_log2,
smp_table = sample_table,
plot_anno = plot_annotation,
gene_type = "all",
geneset = "all",
title = "Heatmap of all present genes",
show_rownames = FALSE,
cluster_cols = TRUE)
rv = genefilter::rowVars(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")])
q75 = quantile(rowVars(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")]), .75)
q75_names = names(which(rv > q75))
# clustered Columns & Rows
plotHeatmap(input=removedbatch_anno_log2,
geneset = q75_names,
smp_table = sample_table,
plot_anno = plot_annotation,
gene_type = "all",
title = "Heatmap of all most variable genes",
show_rownames = FALSE,
cluster_cols = TRUE)
cd_batch_log2<-removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")]
# sampleCor <- as.matrix(cor(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")], use="all.obs", method="pearson"))
# rownames(sampleCor)<- sample_table$ID
# colnames(sampleCor)<- sample_table$ID
# pheatmap(sampleCor,
# main="Sample Correlation based on variance-stabilized counts",
# annotation_row = plot_annotation,
# annotation_col = plot_annotation,
# annotation_colors = ann_colors,
# cluster_rows = F,
# cluster_cols = F,
# fontsize = 8)
corr_function(sampleCor = cd_batch_log2,
title="Sample Correlation based on variance-stabilized counts",
gene_anno=gene_annotation,
plot_anno=plot_annotation,
gene_type="all",
cluster_rows = F,
cluster_cols = F,
mean=F)
# sampleDist <- as.matrix(dist(t(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")])))
# rownames(sampleDist)<- sample_table$ID
# colnames(sampleDist)<- sample_table$ID
# pheatmap(sampleDist,
# clustering_distance_rows = dist(t(assay(dds_vst))),
# clustering_distance_cols = dist(t(assay(dds_vst))),
# main="Sample distances based on variance-stabilized counts per sample",
# annotation_row = plot_annotation,
# annotation_col = plot_annotation,
# annotation_colors = ann_colors,
# fontsize = 8)
dist_function(sampleDist = cd_batch_log2,
gene_anno=gene_annotation,
plot_anno=plot_annotation,
title="Sample distances based on variance-stabilized counts per sample",
gene_type="all",
mean=F)
q75_vst = removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")][rowVars(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")]) > quantile(rowVars(removedbatch_anno_log2[,!colnames(removedbatch_anno_log2)%in%c("GENEID","SYMBOL","GENETYPE","DESCRIPTION","CHR")]), .75),]
rcor <- rcorr(t(q75_vst),type="pearson")
rcor$sig <- rcor$P<0.05 & rcor$r>0 # define significant positive correlations
rcor_filt <- rcor$r*rcor$si
rcor_filt <- rcor_filt*upper.tri(rcor_filt)
rcor_filt<- replace(rcor_filt, which( rcor_filt==0), NA)
rcor_filt_melt <- melt(rcor_filt,na.rm = TRUE)
rcor_filt_melt_cutoff <- rcor_filt_melt[rcor_filt_melt$value>0.95,]
varR <- unique(rcor_filt_melt_cutoff$Var1)
plotHeatmap(input=removedbatch_anno_log2,
geneset = varR,
smp_table = sample_table,
plot_anno = plot_annotation,
gene_type = "all",
title = "Heatmap of all variable and co-expressed genes",
show_rownames = FALSE,
cluster_cols = FALSE)
In case you want to include the observed batch variables in your DESeq2 model, no matter if known covariates or surrogate variables, you can add these to your design formlua in front of the condition of interest and recalculate your dds object. In the following expample we will include a Sex and three SVs as batch effects.
ATTENTION: Skip this chunk, if you do not want to include any batch variables!
# ddssva <- dds
# ddssva$SV1 <- svseq$sv[,1]
# ddssva$SV2 <- svseq$sv[,2]
# ddssva$SV3 <- svseq$sv[,3]
# design(ddssva) <- ~ SV1 + SV2 + SV3 + Genotype_Age
#
# dds <- DESeq(ddssva)
After the DESeq() function performs the standard differential expression analysis steps, DESeq2’s results() function can extract a result table from the DESeqDataSet giving base means across samples, log2 fold changes, standard errors, test statistics, p-values and adjusted p-values.
We have written a function called DEAnalysis() that runs DESeq2’s results() and the lfcShrink() function with specified parameters on a set of pre-defined comparisons and returns a single DEresults object containing the respective result tables together with additional lists of the significant DE genes and the number of DE genes found.
The parameters of the DEAnalysis function are:
condition: specify the condition that you are testing, e.g. treatment or genotype. This value must correspond to the column in the colData listing the factors you are comparing and the design formula.
alpha: a significance cutoff used for optimizing the independent filtering.(default= 0.05)
lfcThreshold: a non-negative value which specifies a log2 fold change threshold. The default value is 0, corresponding to a test that the log2 fold changes are equal to zero. (default = 0)
sigFC: post testing significance criteria as a non-negative, non-log transformed FC cutoff (default = 2)
multiple_testing: By default independent hypothesis weighting will be used as the multiple testing method (https://bioconductor.org/packages/3.7/bioc/vignettes/IHW/inst/doc/introduction_to_ihw.html). However, you can also edit the multiple testing method by setting the multiple_testing parameter to “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,or “fdr”, which will perform independent filtering and p-value adjustment according to the specified method. (default = “IHW”)
shrinkage: After calculating differential expression statistics, we can perform a so-called log2 fold change shrinkage. This shrinkage of effect size (LFC estimates) is useful for visualization and ranking of genes, as it removes the noise associated with log2 fold changes from low count genes without requiring arbitrary filtering thresholds. To shrink the LFC, set shrinkage = TRUE to pass the dds object to the function lfcShrink. (default = TRUE)
shrinkType: The options for the shrinkage type are:
“normal” is the the original DESeq2 shrinkage estimator, an adaptive Normal distribution as prior.This is currently the default, although the default will likely change to apeglm in the October 2018 release given apeglm’s superior performance.
“apeglm” is the adaptive t prior shrinkage estimator from the apeglm package (Zhu, Ibrahim, and Love 2018).
“ashr” is the adaptive shrinkage estimator from the ashr package (Stephens 2016). Here DESeq2 uses the ashr option to fit a mixture of Normal distributions to form the prior, with method=“shrinkage”.
Define your comparisons in a data.frame with “comparison” in the first column and “control” in the second column.
| comparison | control |
|---|---|
| LPStreated_moDCs | Immature_moDCs |
| Immature_IL10APCs | Immature_moDCs |
| LPStreated_IL10APCs | Immature_IL10APCs |
comparison_table<-data.frame(comparison = c("LPStreated_moDCs","Immature_IL10APCs","LPStreated_IL10APCs"),
control = c("Immature_moDCs","Immature_moDCs","Immature_IL10APCs"))
DEresults <- DEAnalysis(condition = "condition",
alpha=0.05 ,
lfcThreshold= 0,
sigFC = 2,
multiple_testing="IHW",
shrinkage = TRUE,
shrinkType="normal")
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
##
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
##
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895
## using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
##
## Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
## See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
## Reference: https://doi.org/10.1093/bioinformatics/bty895
We use a for loop to print the number of significantly up- and down-regulated genes over all comparisons.
DEcounts <- NULL
for(i in 1:nrow(comparison_table)){
tmp <- unlist(DEresults[[1+i]]@Number_DE_genes)
DEcounts <- rbind(DEcounts, tmp)
}
rownames(DEcounts) <- names(DEresults)[-1]
DEcounts
## up_regulated_Genes
## LPStreated_moDCs_vs_Immature_moDCs 1770
## Immature_IL10APCs_vs_Immature_moDCs 739
## LPStreated_IL10APCs_vs_Immature_IL10APCs 137
## down_regulated_Genes
## LPStreated_moDCs_vs_Immature_moDCs 1711
## Immature_IL10APCs_vs_Immature_moDCs 680
## LPStreated_IL10APCs_vs_Immature_IL10APCs 30
# the uDEG() function produces the union of the DE genes from the specified comparisons
allDEgenes <- uDEG(comparisons=c("LPStreated_moDCs_vs_Immature_moDCs",
"Immature_IL10APCs_vs_Immature_moDCs",
"LPStreated_IL10APCs_vs_Immature_IL10APCs"))
plotHeatmap(geneset = allDEgenes,
smp_table = sample_table,
plot_anno = plot_annotation,
gene_type = "all",
title = "Heatmap of all differentially expressed genes",
show_rownames = FALSE,
cluster_cols = TRUE)
if(organism=="mouse"){
DE_TF <- allDEgenes[which(allDEgenes %in% norm_anno[norm_anno$SYMBOL %in% TFlist_mm,]$GENEID)]
} else if(organism =="human"){
DE_TF <- allDEgenes[which(allDEgenes %in% norm_anno[norm_anno$SYMBOL %in% TFlist_hs,]$GENEID)]
}
plotHeatmap(geneset = DE_TF,
smp_table = sample_table,
plot_anno = plot_annotation,
gene_type = "all",
title = "Heatmap of all differentially expressed transcription factors",
show_rownames = TRUE,
cluster_cols = FALSE)
plotVenn(comparisons = c("LPStreated_moDCs_vs_Immature_moDCs",
"Immature_IL10APCs_vs_Immature_moDCs"),
regulation ="up")
plotVenn(comparisons = c("LPStreated_moDCs_vs_Immature_moDCs",
"Immature_IL10APCs_vs_Immature_moDCs"))
Ratio-Ratio plots help to compare the DE genes of two comparisons and are much more expressive than venn diagrams.
plotRatios(comp1 = "LPStreated_moDCs_vs_Immature_moDCs",
comp2 = "Immature_IL10APCs_vs_Immature_moDCs")
plotFCrank(comp1 = "LPStreated_moDCs_vs_Immature_moDCs",
comp2 = "Immature_IL10APCs_vs_Immature_moDCs")
Define universe and gene sets for subsequent GSEA analyses.
# define universe
universe <- as.character(norm_anno$SYMBOL)
# change symbols to ENTREZ IDs (necessary for ClusterProfiler)
universe_Entrez <- bitr(universe,
fromType="SYMBOL",
toType="ENTREZID",
OrgDb="org.Hs.eg.db")$ENTREZID
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(universe, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## "org.Hs.eg.db"): 24.2% of input gene IDs are fail to map...
universe_mouse2human <- getLDS(attributes = c("mgi_symbol"),
filters = "mgi_symbol",
values = universe,
mart = useMart("ensembl", dataset = "mmusculus_gene_ensembl"),
attributesL = c("hgnc_symbol"),
martL = useMart("ensembl", dataset = "hsapiens_gene_ensembl"),
uniqueRows=T)[,2]
universe_mouse2human_Entrez <- bitr(universe_mouse2human,
fromType="SYMBOL",
toType="ENTREZID",
OrgDb="org.Hs.eg.db")$ENTREZID
## 'select()' returned 1:many mapping between keys and columns
## Warning in bitr(universe_mouse2human, fromType = "SYMBOL", toType =
## "ENTREZID", : 0.22% of input gene IDs are fail to map...
Next, we perform functional enrichment analysis based on Gene ontology and KEGG pathway enrichment across all comparisons tested to check for overlap in functional indications of the differentially regulated genes.
DEcompare <- compareGSEA(comparisons = c("LPStreated_moDCs_vs_Immature_moDCs",
"Immature_IL10APCs_vs_Immature_moDCs",
"LPStreated_IL10APCs_vs_Immature_IL10APCs"),
organism = organism,
GeneSets = c("GO", "KEGG"),
pCorrection = "bonferroni",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
showMax = 20,
ontology = "BP")
## Warning in bitr(DE_up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 9.24% of input gene IDs are fail to map...
## Warning in bitr(DE_down, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 8.66% of input gene IDs are fail to map...
## Warning in bitr(DE_up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 7.32% of input gene IDs are fail to map...
## Warning in bitr(DE_down, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 5.9% of input gene IDs are fail to map...
## Warning in bitr(DE_up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 3.68% of input gene IDs are fail to map...
## [1] "Performing GO enrichment"
## [1] "Performing KEGG enrichment"
DEcompare$GOplot+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
DEcompare$KEGGplot+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
DESeq2 offers two kinds of hypothesis tests: the Wald test, where we use the estimated standard error of a log2 fold change to test if it is equal to zero, and the likelihood ratio test (LRT). The LRT examines two models for the counts, a full model with a certain number of terms and a reduced model, in which some of the terms of the full model are removed. The test determines if the increased likelihood of the data using the extra terms in the full model is more than expected if those extra terms are truly zero.
The LRT is therefore useful for testing multiple terms at once, for example testing 3 or more levels of a factor at once, or all interactions between two variables. The LRT for count data is conceptually similar to an analysis of variance (ANOVA) calculation in linear regression, except that in the case of the Negative Binomial GLM, we use an analysis of deviance (ANODEV), where the deviance captures the difference in likelihood between a full and a reduced model.
The likelihood ratio test can be performed by specifying test=“LRT” when using the DESeq function, and providing a reduced design formula, e.g. one in which a number of terms from design(dds) are removed. The degrees of freedom for the test is obtained from the difference between the number of parameters in the two models. A simple likelihood ratio test, if the full design was ~Genotype_Age would look like:
dds_lrt <- DESeq(dds,
test="LRT",
reduced=~1)
## using pre-existing normalization factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res_lrt <- results(dds_lrt)
lrt_anno<- as.data.frame(assay(dds_lrt))
# add gene annotation extracted from the gtf file
gene_annotation <- gene_annotation[match(rownames(lrt_anno), gene_annotation$GENEID),]
# check if row names of the lrt_anno and the gene annotation match perfectly
all(rownames(lrt_anno) == gene_annotation$GENEID)
## [1] TRUE
lrt_anno$GENEID<-rownames(lrt_anno)
lrt_anno <- merge(lrt_anno,
gene_annotation,
by = "GENEID")
rownames(lrt_anno)<-lrt_anno$GENEID
lrt_anno[1:3,c(1:2, (ncol(lrt_anno)-5):ncol(lrt_anno))]
## GENEID 1553 1567 1568 SYMBOL GENETYPE
## ENSG00000000003.14 ENSG00000000003.14 0 1 1 TSPAN6 protein_coding
## ENSG00000000419.12 ENSG00000000419.12 182 282 381 DPM1 protein_coding
## ENSG00000000457.13 ENSG00000000457.13 48 88 131 SCYL3 protein_coding
## DESCRIPTION
## ENSG00000000003.14 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## ENSG00000000419.12 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## ENSG00000000457.13 SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## CHR
## ENSG00000000003.14 X
## ENSG00000000419.12 20
## ENSG00000000457.13 1
Creates mean for lrt_anno
lrt_mean<-mean_function(input=lrt_anno,
anno=sample_table,
condition="condition")
# Plot heatmap of significantly different genes
lrt_significantGenes <- rownames(res_lrt[!is.na(res_lrt$padj)&
res_lrt$padj < 0.05 ,])
plotHeatmap(input=norm_anno,
smp_table = sample_table,
plot_anno = plot_annotation,
geneset = lrt_significantGenes,
title = paste("Heatmap of all LRT-significant genes:", length(lrt_significantGenes),"(normalized values)", sep=""),
show_rownames = FALSE,
cluster_cols = T,
gene_type = "all")
plotHeatmap(input=norm_mean,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
geneset = lrt_significantGenes,
title = paste("Heatmap of all LRT-significant genes:", length(lrt_significantGenes),"(normalized values)", sep=""),
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
# Question to Steffi & Jonas:
#Do we want to plot the normalized counts for the heatmap or the LRT results(IDK in how far they vary from the normalization)
plotHeatmap(input=lrt_anno,
smp_table = sample_table,
plot_anno = plot_annotation,
geneset = lrt_significantGenes,
title = paste("Heatmap of all LRT-significant genes:", length(lrt_significantGenes),"(LRT-values)", sep=""),
show_rownames = FALSE,
cluster_cols = T,
gene_type = "all")
plotHeatmap(input=lrt_mean,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
geneset = lrt_significantGenes,
title = paste("Heatmap of all LRT-significant genes:", length(lrt_significantGenes),"(LRT-values)", sep=""),
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
# Plot heatmap of 1000 top significant genes
lrt_1000mostvariableGenes <- head(rownames(res_lrt[order(res_lrt$padj),]),1000)
plotHeatmap(input=norm_anno,
smp_table = sample_table,
plot_anno = plot_annotation,
geneset = lrt_1000mostvariableGenes,
title = "Heatmap of 1000 top significant LRT genes",
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
plotHeatmap(input=norm_mean,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
geneset = lrt_1000mostvariableGenes,
title = "Heatmap of 1000 top significant LRT genes",
show_rownames = FALSE,
cluster_cols = TRUE,
gene_type = "all")
Now that we have had a global look at the differentially expressed genes, we want to have a more detailed look at specific comparisons. Therefore, we use various plots to get an impression of the differences in the specified comparisons.
A MAplot represents the average log expression versus the average ratio (or fold change) between two conditions.
plotMA(comparison = "LPStreated_moDCs_vs_Immature_moDCs",
ylim=c(-2,2))
We plot a histogram of the p-values resulting from testing the specified comparisons.
To understand p-value distributions, this source is very helpful: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/
plotPvalues(comparison="LPStreated_moDCs_vs_Immature_moDCs")
Plot heatmaps of the expression of the DE genes of the specified comparisons. The argument conditions allows to include samples from specified conditions in the heatmap.
plotDEHeatmap(comparison = "LPStreated_moDCs_vs_Immature_moDCs",
factor="condition",
conditions="all",
gene_type="all",
show_rownames = FALSE,
cluster_cols = F)
plotDEHeatmap(comparison = "Immature_IL10APCs_vs_Immature_moDCs",
factor="condition",
conditions=c("Immature_moDCs","Immature_IL10APCs", "LPStreated_moDCs"),
gene_type="all",
show_rownames = FALSE,
cluster_cols = F)
A volcano plot is a type of scatter-plot that is used to quickly identify changes in large data sets composed of replicate data. It plots significance versus fold-change on the y and x axes, respectively.
# Plot Volcano Plot
plotVolcano(comparison= "LPStreated_moDCs_vs_Immature_moDCs",
labelnum=20)
Next, we perform gene set enrichment analyses of the DE genes of the specified comparisons.
6 gene sets are available: * GO * KEGG * MSigDB Hallmark (H) * MSigDB Cannonical Pathways (C2) Gene sets from pathway databases. Usually, these gene sets are canonical representations of a biological process compiled by domain experts.
* MSigDB Motifs (C3, DNA binding motifs) Gene sets representing potential targets of regulation by transcription factors or microRNAs. The sets consist of genes grouped by short sequence motifs they share in their non-protein coding regions. The motifs represent known or likely cis-regulatory elements in promoters and 3’-UTRs. The C3 collection is divided into two sub-collections: MIR and TFT
* MSigDB Immunologic gene sets (C7) Gene sets that represent cell states and perturbations within the immune system. The signatures were generated by manual curation of published studies in human and mouse immunology.
GSEA_LPStreated_moDCs_vs_Immature_moDCs <- GSEA(comparison="LPStreated_moDCs_vs_Immature_moDCs",
organism = "human",
GeneSets = c("GO",
"KEGG",
"Hallmark",
"canonicalPathways",
"Motifs",
"ImmunoSignatures"),
GOntology = "BP",
pCorrection = "bonferroni", # choose the p-value adjustment method
pvalueCutoff = 0.05, # set the unadj. or adj. p-value cutoff (depending on correction method)
qvalueCutoff = 0.05, # set the q-value cutoff (FDR corrected)
showMax = 20,
font.size = 8)
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(DE_up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 9.24% of input gene IDs are fail to map...
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(DE_down, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## OrgDb): 8.66% of input gene IDs are fail to map...
## [1] "Performing GO enrichment"
## wrong orderBy parameter; set to default `orderBy = "x"`
## wrong orderBy parameter; set to default `orderBy = "x"`
## [1] "Performing KEGG enrichment"
## wrong orderBy parameter; set to default `orderBy = "x"`
## wrong orderBy parameter; set to default `orderBy = "x"`
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(genes_up_hsa, fromType = "SYMBOL", toType = "ENTREZID", :
## 0.08% of input gene IDs are fail to map...
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(genes_down_hsa, fromType = "SYMBOL", toType = "ENTREZID", :
## 0.22% of input gene IDs are fail to map...
## [1] "Performing Hallmark enrichment"
## wrong orderBy parameter; set to default `orderBy = "x"`
## wrong orderBy parameter; set to default `orderBy = "x"`
## [1] "Performing Motif enrichment"
## wrong orderBy parameter; set to default `orderBy = "x"`
## [1] "Performing immunesignature enrichment"
## wrong orderBy parameter; set to default `orderBy = "x"`
## wrong orderBy parameter; set to default `orderBy = "x"`
GSEA_LPStreated_moDCs_vs_Immature_moDCs$GOup_plot
GSEA_LPStreated_moDCs_vs_Immature_moDCs$KEGGup_plot
GSEA_LPStreated_moDCs_vs_Immature_moDCs$HALLMARKup_plot
GSEA_LPStreated_moDCs_vs_Immature_moDCs$ImmSigup_plot
Please make sure that the specified terms are enriched terms of the selected results.
plotGSEAHeatmap(GSEA_result = GSEA_LPStreated_moDCs_vs_Immature_moDCs$GOdown,
GeneSet="GO",
term = "SRP-dependent cotranslational protein targeting to membrane",
show_rownames = TRUE,
cluster_cols = F,
gene_type = "all")
plotGSEAHeatmap(GSEA_result = GSEA_LPStreated_moDCs_vs_Immature_moDCs$GOup,
GeneSet="GO",
term = "response to virus",
show_rownames = TRUE,
cluster_cols = F,
gene_type = "protein_coding")
plotGSEAHeatmap(input=norm_mean,
smp_table = mean_sample_table,
plot_anno = mean_plot_annotation,
GSEA_result = GSEA_LPStreated_moDCs_vs_Immature_moDCs$GOup,
GeneSet="GO",
term = "response to virus",
show_rownames = TRUE,
cluster_cols = F,
gene_type = "all")
Exporting the annotated, normalized expression table:
write.table(norm_anno,
paste(dir, "/Analysis/", "Tables/", "DESeq2_norm_anno_", Sys.Date(), ".txt", sep = ""),
sep = "\t",
quote = F,
row.names = F)
As an optimal output for cooperation partners, we create an Excel workbook with the normalized count table, the rlog-transformed counts, the DE test parameters & the statistics of the respective differential expression tests.
# create Workbook
ExcelOutput<-createWorkbook()
# add sample table
sheet <- addWorksheet(ExcelOutput, sheetName = "Samples")
writeDataTable(ExcelOutput, sheet, sample_table, withFilter=FALSE)
# add normalized counts
tmp <- norm_anno
tmp$LRT_padj <- res_lrt$padj
sheet <- addWorksheet(ExcelOutput, sheetName = "Normalized counts & Annotation")
writeDataTable(ExcelOutput, sheet, tmp, withFilter=FALSE)
# add variance-stabilized counts
tmp <- as.data.frame(assay(dds_vst))
tmp$GENEID <- norm_anno$GENEID
tmp$SYMBOL <- norm_anno$SYMBOL
sheet <- addWorksheet(ExcelOutput, sheetName = "Variance-stabilized counts")
writeDataTable(ExcelOutput, sheet, tmp, withFilter=FALSE)
# add DE test parameters
tmp <- stack(unlist(DEresults$parameters))
colnames(tmp)<-c("value","parameter")
tmp <- rbind(tmp, data.frame(value = as.character(design(dds))[2], parameter = "design"))
sheet <- addWorksheet(ExcelOutput, sheetName = "DE parameters")
writeDataTable(ExcelOutput, sheet, tmp, withFilter=FALSE)
# add combined DE results
cResults <- NULL
for (i in 2:length(names(DEresults))) {
if(i == 2){
cResults<-data.frame(DEresults[[i]]@results[,c("GENEID","SYMBOL","baseMean","log2FoldChange","padj","regulation")])
colnames(cResults)[4:6]<- paste(names(DEresults)[i],colnames(cResults)[4:6],sep="_")
}else{
tmp<-data.frame(DEresults[[i]]@results[,c("log2FoldChange","padj","regulation")])
colnames(tmp)<- paste(names(DEresults)[i],colnames(tmp),sep="_")
cResults<-cbind(cResults,tmp)
}
}
sheet <- addWorksheet(ExcelOutput, sheetName = "combined DE Results")
writeDataTable(ExcelOutput, sheet, cResults, withFilter=FALSE)
# add DE results in single sheets
for(i in 2:length(names(DEresults))){
gc()
sheet <- addWorksheet(ExcelOutput, sheetName = substr(names(DEresults[i]),1 , 30))
writeDataTable(ExcelOutput, sheet, DEresults[[i]]@results, withFilter=FALSE)
sheet <- addWorksheet(ExcelOutput, sheetName = paste(substr(names(DEresults[i]),1 , 22),"_upDEGs"))
writeDataTable(ExcelOutput, sheet, DEresults[[i]]@DE_genes$up_regulated_Genes, withFilter=FALSE)
sheet <- addWorksheet(ExcelOutput, sheetName = paste(substr(names(DEresults[i]),1 , 20),"_downDEGs"))
writeDataTable(ExcelOutput, sheet, DEresults[[i]]@DE_genes$down_regulated_Genes, withFilter=FALSE)
}
# Save Workbook
filename <- paste(dir,"/Analysis/Tables/AnalysisOutput_",gsub(":","-",as.character(Sys.time())),".xlsx",sep="")
saveWorkbook(ExcelOutput, file=filename)
save.image(paste(dir, "/Analysis/", Sys.Date(), "_Image.RData", sep = ""))
sessionInfo()